background image used for decoration
logo of company

Data and Code


Author: EU Tax Observatory

Date: August 20, 2025

This document presents some general advice about how to collaborate, manage, store, and communicate on coding projects related to research.

Note

This is just general advice! Of course, it depends on the project, the goal to be achieved, and keeping some flexibility is always a good idea!

Sharing code and versioning

Whereas comments on coding scripts, data accessibility, and transparency are the key elements for full reproducibility, the main issue is how to collaborate and have a consistent repository between the collaborators of the project. Let’s say that two members work jointly on the coding part; keeping the script up to date for both is challenging. Moreover, let’s say that you want to reproduce a figure that was produced six months ago in your script. How is it possible to do it smoothly?

The key is to have a shared repository for all team members. In such a way, every change is transparent and easily accessible. In addition, you need to have all versions of the coding scripts. We present here two options for sharing and collaborating on code. We also summarize the pros and cons of both options.

Warning

In this section, we do not talk about storing data. We only focus on code. Storing data depends mainly on the regulations that apply. For instance, if you have any confidential data, please make sure to use a data sharing solution that is compliant with the GDPR. For more details, see Section 7.

Option Pros Cons
Cloud option
  • Easy to implement
  • File version
  • Difficult to track changes
Github
  • File version
  • Easy to track changes
  • Investment cost for all member
  • Difficult to manage with CASD

Cloud option

The first is the easiest to implement. Using a shared repository in the Cloud, it is possible to share the coding part with all team members, perhaps the entire project (including drafts, reports, and papers). Using a solution such as Dropbox, Nextcloud PSE, or another type, it is easily manageable and works as a classic repository on your laptop.

Currently, most cloud options offer the possibility to access previous versions of a file. Let’s say you want to go back to a previous version of your file (to change a very cool graph), you can do that. The only thing is to configure your cloud repo in such a way that it stores all file versions. This might be space-consuming, so you have to be sure that you have enough space to store your files.

However, the main drawback of the simplest solution is that it also offers a limited solution. Let’s say that a team member changes something in your coding repository that you want to check. You only know which file is concerned (regarding last modification date), but you do not know where to look within the file. You have to check the file entirely and identify the part being changed. If the same thing appears for multiple files, it becomes worse.

Advantages
  • Easy to implement
  • Can keep all file versions
  • Do not require any technical skills
Drawbacks
  • Identifying changes within the code is tricky
  • It becomes worse with a repository with numerous files

GitHub

GitHub answers the main drawback of the cloud option: identifying changes within a repository. In addition, it registers all changes between two file versions and lists any changes between two updates. Hence, it is easy to track changes and check if everything is right. However, it requires an investment from all team members to manage GitHub.

So, how to use GitHub? You don’t need to be a nerd with Terminal applications to manage a GitHub project (you can do it if you want!). The GitHub Desktop enables smoothly managing such a project and provides an interface to easily monitor changes over time.

Let’s talk about the main actions for a GitHub project. For setting up a repository, see here. We now have a repository with two team members, such as

---
config:
  theme: neutral
  max-width: 600
---

flowchart TB;
    A[Member 1] --> B(Project);
    C[Member 2] --> B;
Figure 1: Coding Project (First modification)

Let’s say that member 1 works on the project and makes some significant changes in their laptop. For now, changes made by Member 1 are not shared with Member 2. It has to commit and push its changes.

---
config:
  theme: neutral
  max-width: 600
---

flowchart TB;
    A[Member 1] -->|Commit| B;
    C[Member 2] --> B;
Figure 2: Coding Project (Pushing the modification for the entire team)

Now, consider that changes made by Member 1 are committed and pushed. The Member 2 wants to work on the project and check the track changes. It has to pull the project from its laptop to see any changes. After that, both repositories are synchronized.

---
config:
  theme: neutral
  max-width: 600
---

flowchart LR;
    A[Member 1] --> B(Project);
    B --> |Pull| C[Member 2];
Figure 3: Coding Project (Pulling modifications for alternative members)

In addition, Member 2 can monitor changes being achieved by Member 1 in more detail, without reviewing all the code. Some software offers this possibility, but GitHub Desktop offers good highlights such as

Tracking changes
Note

Here, lines highlighted in green are added by Member 1, whereas lines highlighted in red are removed by Member 1. The stable part of the coding file is uncolored (not applicable here).

In the end, every member can trace back every change in the coding file using the number of the commit. By doing so, it ensures that every version of the coding files is available. Every member can thus generate every figure or table, regardless of which version the figure or table was in. To sum up, the project is composed of numerous evolutions that can be identified by their number, such as

---
config:
  theme: neutral
---

gitGraph
    commit
    commit
    commit
    commit
    commit
    commit
Figure 4: Git

Hence, we advise you to put this number in your logfile when making a commit and a significant change in the project. It helps to monitor and follow the evolution of the code.

Finally, the GitHub environment is not accessible in the CASD for security reasons. In the CASD, the repository is shared between all team members. However, there is no system to track changes in code between members. However, it is possible to use Git locally (the software behind GitHub) to monitor code in the same way locally and benefit from versioning and changes highlighting.

Advantages
  • Identify changes between files
  • Can keep all file versions
Drawbacks
  • Require some technical investment from all team members
  • Not applicable for the CASD

Comments

Coding files can be hard to understand, especially for non-experts in the language of consideration. Moreover, data processing can be achieved following different packages, especially in R, which might make code reading difficult for outsiders. For instance, let’s take the following example.

df_data[, u_p := val / sur]
df_data <- df_data[u_p > 120 & u_p < 20000]
df_data[, sur := NULL]
lm(u_p ~ y_sale + h_type + l_mut, data = df_data)

gen u_p = val / sur
keep if u_p > 120 & u_p < 20000
drop sur
regress u_p y_sale i.h_type i.origin

If you are not an R expert, understanding such a chunk of code might be difficult. Hence, assessing whether there is an issue or a bad methodological choice is highly challenging. Now, let’s look at the commented version of this script.

# we create a new column `u_p` that represents the housing unitary price: 
# housing value (`val`) divided by surface (`sur`)
df_data[, u_p := val / sur]

# we now filter observations based on the housing unitary housing price.
# we keep observations above 120 and under 20,000
df_data <- df_data[u_p > 120 & u_p < 20000]

# we remove the column surface
df_data[, sur := NULL]

# finally, we regress the unitary housing price on multiple variables:
# - y_sale: year of sale
# - h_type: housing type
# - l_mut: last date of mutation
# coefficients are obtained through OLS

lm(u_p ~ y_sale + h_type + l_mut, data = df_data)
* Create the new variable u_p
gen u_p = val / sur

* Keep observations where u_p is between 120 and 20000
keep if u_p > 120 & u_p < 20000

* Drop the variable sur
drop sur

* Run the linear regression
regress u_p y_sale i.h_type i.origin

Comments are extremely helpful for others to understand your code. As a result, they can more easily spot any coding mistakes or poor methodological choices.

Additionally, consider revisiting an old project six months or a year later. Your coding practices may have evolved, and you might have switched packages, making it more challenging to navigate through your code. This is particularly true for lengthy code files, which can be difficult to comprehend after being away from them for several months!

However, for straightforward code, comments are not mandatory as they complicate the reading of the code. Hence, commenting code is a balance between having a clear explanation about what is performed and keeping the code as clean as possible. Performing code review (see Section 8) helps to adjust code comments.

The logfile

To monitor projects, and more specifically long-term projects, having a logfile is highly advised. This file registers all modifications being made with information such as the date, the author, and a brief explanation. It helps to track the evolution of the project, the main changes within the project, and also individuals’ contributions to the coding part of the research project. It is highly advised to use a repository using the cloud option rather than the GitHub one. By doing so, it helps to identify changes and also makes researching a specific file version easier. Here is a snapshot of the logfile used for an ongoing project:


# To-do
- [ ] Make some descriptive statistics about tenure status
- [ ] Merging and comparison with Orbis database
- [ ] Need to assess the quality of changes within the BO register
- [ ] Launch a new collection for 2025
- [ ] Outcomes about rental eviction, housing maintenance, and renters' income to compute
- [ ] Access to the TVVI database

# Previous changes

## 25.05.06

- Code for the internal workshop is ok. RL 2db4c55 
- Figures are currently working. First push in a GitHub repo in the next weeks. RL 2db4c55
- Code review. RL 2db4c55

## 25.04.20

- New code to have a matching rate per percentile. RL 688eea8
- We also account for the individual shares and highlight that the 25% rule is a main limit for transparency. RL 688eea8
- New map for Paris. RL 688eea8
Note

The logfile is most of the time written in a lightweight open source format such as txt or Markdown to be easily readable and writable, regardless of the OS.

The logfile might also contain a to-do section. Research projects are not always linear, and writing future steps to introduce in this file is useful. Indeed, when we reopen a new project, looking at a logfile provides us with a good overview of what was achieved and what is needed.

Finally, sharing a logfile between contributors is a key element to coordinating your efforts. Taking a look at the logfile enables all team members involved in the project to see what the future steps are to be implemented, whereas they can easily monitor previous achievements. The authorship makes communication easier, especially if there is a misunderstanding in the coding parts.

Data accessibility

The data used for the empirical analysis is key information to provide to ensure consistent and reproducible information. Hence, we need to detail the following information:

  • Data source: where can outsiders access the data?
  • Data version: for updated datasets, specify the version used in this project (especially for flow data)
  • Metadata

For the metadata, it is useful to list for all team members who might join the project what is being stored in the main datasets. Ideally, having a clear explanation of each column, what the observation unit is, whether it is a custom column or derived directly from raw data. In addition, having some units might be of interest as well. It is time-consuming to do so (especially for a large dataset), but it helps a lot for newcomers to understand the data. If an online documentation exists elsewhere, do not hesitate to redirect directly to this website. However, in such a case, please make sure that your columns have the same names as the original data.

README file

The README is probably the first file an outside individual will open when accessing your research project. In your README file, you must put important information such as:

  • Title and authorship
  • The main objective of the research project
  • Information about how to access the data
  • What are the main software being used in the project
  • The license of your code, in our case, is mostly open, but it depends
  • Any explanation that helps individuals understand your repository
Note

The README is always written in an open file format, mainly Markdown.

An example of a README file


# Name of the project

## Overview
Here, we discuss the objective of the project, a snapshot of the main conclusion, and potential redirection to the paper.

## Features of the code
List the big steps of making your code accessible. For instance:

- Data collection
- Data cleaning
- Filtering of the dataset
- Descriptive statistics
- Econometric analysis

## License
This project is licensed under the MIT License. See the LICENSE file for repo details.

## Contact
For any questions or feedback, please contact:

Your Name: 
GitHub: 

Organizing directory and files

Besides the scripts, structuring a coding directory is also a good practice to have. First, one script to do everything must be avoided. Research projects can be large, including some data loading, data management, descriptive statistics, and econometric analysis. One standalone file would be too large to be easily understandable by team members or outsiders.

On the other hand, spread-out files are hard to understand. Let’s say you join an ongoing project with multiple files to be handled. How to know which is the first to execute? Order in scripting files is a key element to ensure full reproducibility. Let’s say you run the econometric analysis before the filtering step; the results would be dramatically different. Hence, one file must aggregate everything and call each subscript.

We can call this script main_script and structure the code as follows (example from an ongoing project)

################################################################################
# INTREALES Project Code Preamble
################################################################################

# -----------------------------------------------------------------------------
# Project Information
# -----------------------------------------------------------------------------

# Author:  Author 1, Author 2, Author 3
# Title:   Code of super cool project
# Date:    2025-04-08
# Version: 1.0

# -----------------------------------------------------------------------------
# Load Necessary Libraries
# -----------------------------------------------------------------------------

# load all relevant packages for the analysis
source("init/packages.R")
theme_update(text = element_text(family = "serif"))
# -----------------------------------------------------------------------------
# Additional Setup or Configuration
# -----------------------------------------------------------------------------

output_table <- "output_code/table/" ## location of table outputs
output_figure <- "output_code/figure/" ## location of figure outputs
choice_w <- 16 # width of the output graphics (in inches)
choice_h <- 9 # height of the output graphics (in inches)

# -----------------------------------------------------------------------------
# Main Code
# -----------------------------------------------------------------------------

################################################################################

# Loading data -----------------------------------------------------------------

source("code/data/01_loading_data.R")
source("code/data/02_filtering_data.R")

# Descriptive statistics about the topic of interest ---------------------------

source("code/descriptive_statistics/01_summary_stat_sample.R")
source("code/descriptive_statistics/02_stat_observation_interest.R")

# Running an econometric analysis ---------------------------------------------

source("code/econometric_analysis/01_diff_in_diff.R")
source("code/econometric_analysis/02_robustness_chekcs.R")
source("code/econometric_analysis/03_placebo.R")
* ################################################################################
* INTREALES Project Code Preamble
* ################################################################################

* -----------------------------------------------------------------------------
* Project Information
* -----------------------------------------------------------------------------

* Author:  Author 1, Author 2, Author 3
* Title:   Code of super cool project
* Date:    2025-04-08
* Version: 1.0

* -----------------------------------------------------------------------------
* Load Necessary Libraries
* -----------------------------------------------------------------------------

* In Stata, we typically use `ssc install` or `net install` to install packages.
* For example, to install a package, you might use:
* ssc install package_name

* -----------------------------------------------------------------------------
* Additional Setup or Configuration
* -----------------------------------------------------------------------------

* Define output directories
global output_table "output_code/table/" ## location of table outputs
global output_figure "output_code/figure/" ## location of figure outputs

* Define graphics dimensions
global choice_w = 16 # width of the output graphics (in inches)
global choice_h = 9 # height of the output graphics (in inches)

* -----------------------------------------------------------------------------
* Main Code
* -----------------------------------------------------------------------------

* ################################################################################

* Loading data -----------------------------------------------------------------

do "code/data/01_loading_data.do"
do "code/data/02_filtering_data.do"

* Descriptive statistics about the topic of interest ---------------------------

do "code/descriptive_statistics/01_summary_stat_sample.do"
do "code/descriptive_statistics/02_stat_observation_interest.do"

* Running an econometric analysis ---------------------------------------------

do "code/econometric_analysis/01_diff_in_diff.do"
do "code/econometric_analysis/02_robustness_checks.do"
do "code/econometric_analysis/03_placebo.do"

The main script is quite simple to read, regardless of whether you are an R expert. The objective is to provide all the needed steps with some comments to understand each step. The script can be decomposed into different sections.

First, we introduce a preamble. We provide the key information about the project, such as the persons involved in it, the date, and the objective of the project. Second, we load everything we need. It avoids loading a package multiple times. Also, it is helpful when you need to provide all relevant packages to be loaded to an alternative member. You can also easily list all your packages when you need to provide the version to ensure full reproducibility. Third, we load the data. Then, we run the analysis.

Naming files is an important thing. When you open a coding directory, you want to understand easily how the directory is structured. That’s why we advise you to store your subscripts within subdirectories with explicit names such as descriptive statistics, data, or econometric_analysis. It helps to navigate through the directory and understand the code and the underlying choices (which is, in the end, the main thing). Within each subdirectory, we advise you to number the coding files, just to remind you in which order these scripts must be executed. In the same fashion, naming objects within the script should be as clear as possible to ensure team members understand what is what (but remind you that comments are helpful also!).

Finally, we can sum up the structure of the project as follows in a more general manner.

  • code
    • README.md
    • main_script.R
    • logfile.md
    • descriptive_statistics
      • 01_stat_code.R
      • 02_stat_code.R
    • loading_data
      • 01_loading_data.R
      • 02_filter_data.R
    • econometric_analysis
      • 01_diff_in_diff.R
      • 02_robustness.R
  • output_code
    • figure
      • fig_stat_des.png
      • fig_stat_des_sample.png
    • table
      • tab_summary_stat.tex
      • tab_main_results.tex
  • code
    • README.md
    • main_script.do
    • logfile.md
    • descriptive_statistics
      • 01_stat_code.do
      • 02_stat_code.do
    • loading_data
      • 01_loading_data.do
      • 02_filter_data.do
    • econometric_analysis
      • 01_diff_in_diff.do
      • 02_robustness.do
  • output_code
    • figure
      • fig_stat_des.png
      • fig_stat_des_sample.png
    • table
      • tab_summary_stat.tex
      • tab_main_results.tex

Here, we have two main directories (code and output_code). The second directory contains all figures and tables being produced by the script. Hence, it is easy to search for output figures.

Storing data

The main script is quite simple to read, regardless of whether you are an R expert. GitHub is a platform to share and collaborate on code. But, how about data? Generally speaking, we should avoid storing any data on it for two reasons. First, it relates to the sensitivity of the data.

To store data securely, you can use the PSE NextCloud solution. Each member of the EU Tax Observatory is a member of PSE NextCloud, which offers a Cloud solution for storing files, documents, or data. It is similar to common Cloud services such as Dropbox, but the servers are located in Paris, within PSE. Hence, it provides a way to store and share data between members, which is compliant with the RGPD. You can improve the security level of this repository by adding passwords or time restrictions.

Note

In R, there is a package to link your coding repository with your NextCloud data repo.

Besides the storage aspect, it is important to store data in an easily readable format, such as CSV or Parquet, to ensure an easy opening for other members or replicators. Proprietary formats such as Excel should be avoided.

Code review

Finally, mistakes in coding files are common. Performing a code review from time to time, some code review might be a good practice to ensure that everything runs as planned. For instance, you can add a 0 when filtering data that affects the composition of your sample, introduce a filter to simplify the data process in the first place that remains, or just comment on some parts of the code that are useful. Making code review helps to track and correct these potential mistakes. When you are writing your coding file, you don’t necessarily have the hindsight to identify all issues (think about when writing a draft, you have some inconsistencies all along).

A code review is just looking at the entire code sequentially and ensuring that everything is fine. First, you must be sure that your code is not buggy when you are running it. If it happens, you must correct it. Second, you must track any typo issues, such as the filtering process, wrong column assignment, etc., that might affect the results. Third, you may find your code too complex. If you can simplify it through a partial re-writing process, you should do it. Even if comments help understand your process, the easier your code is, the more understandable it is.

Software versioning

Finally, a point that is mostly omitted in reproducible coding is software and package versioning. A package or software might evolve, which might introduce some bugs in your code. For instance, a code written in 2019 might be broken if we use the current packages. Hence, we need to specify the version of the main software (R, Python, or Stata) and also the attached version of the packages used. For R, we provide the following code to access software and package versions.

# it provides information about the R version being used
# also, it lists all packages being installed in your session
sessionInfo() 
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin24.4.0
Running under: macOS Sequoia 15.6

Matrix products: default
BLAS:   /opt/homebrew/Cellar/openblas/0.3.30/lib/libopenblasp-r0.3.30.dylib 
LAPACK: /opt/homebrew/Cellar/r/4.5.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Paris
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.5.1    fastmap_1.2.0     cli_3.6.5        
 [5] tools_4.5.1       htmltools_0.5.8.1 yaml_2.3.10       rmarkdown_2.29   
 [9] knitr_1.50        jsonlite_2.0.0    xfun_0.52         digest_0.6.37    
[13] rlang_1.1.6       evaluate_1.0.4   
# it returns the current version of the package of interest
packageVersion("data.table")
[1] '1.17.8'
* Display Stata version information
version

* Display information about installed packages
ado dir

* To get more detailed information about a specific package, you can use:
ado describe <package_name>

Here, the current version of R is 4.4.3, whereas the version of the data.table package is 1.17.0. I need to share this information to ensure that the results are fully reproducible, even in 10 years. An alternative is to use a Docker file, which provides all relevant files and packages to run your script and replicate your results.

Additional resources about coding

You can find here some resources: